データの全体像¶

train:1,482,535個のユーザーが投稿した商品

test:693,359行で「価格(Price)」の項目がテストデータは含まれていないため、列数は「7」となっている

  • train_id / test _id – ユーザー投稿のID
  • name – 投稿のタイトル。タイトルに価格に関する情報がある場合(例:$20)はメルカリが事前に削除をして[rm]と置き換えている。
  • item_condition_id – ユーザーが指定した商品の状態 (1~5,大きい方が状態が良い)
  • category_name – 投稿カテゴリー
  • brand_name – ブランドの名前
  • price – 訓練データのみ。実際に売られた価格。米ドル表示。今回のチャレンジの予測ターゲットとなる。
  • shipping – 送料のフラグ。「1」は販売者負担。「0」は購入者負担。
  • item_description – ユーザーが投稿した商品説明の全文。タイトルと同様に価格情報がある場合は[rm]と置き換えられている。

ライブラリのインポート¶

In [1]:
import re
import string
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns

from wordcloud import WordCloud
from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS

import plotly.offline as py
py.init_notebook_mode(connected=True)
import plotly.graph_objs as go
import plotly.tools as tls
%matplotlib inline

データの読み込み¶

In [2]:
%%time
# Kaggle Notebook
train = pd.read_csv("/kaggle/input/mercari-dataset/train.tsv", delimiter='\t')
test = pd.read_csv("/kaggle/input/mercari-dataset/test.tsv", delimiter='\t')

# Local
# train = pd.read_csv("train.tsv", delimiter='\t')
# test = pd.read_csv("test.tsv", delimiter='\t')
CPU times: user 8.56 s, sys: 728 ms, total: 9.29 s
Wall time: 9.3 s
In [3]:
print(train.shape)
print(train.info())
display(train.head())
(1482535, 8)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1482535 entries, 0 to 1482534
Data columns (total 8 columns):
 #   Column             Non-Null Count    Dtype  
---  ------             --------------    -----  
 0   train_id           1482535 non-null  int64  
 1   name               1482535 non-null  object 
 2   item_condition_id  1482535 non-null  int64  
 3   category_name      1476208 non-null  object 
 4   brand_name         849853 non-null   object 
 5   price              1482535 non-null  float64
 6   shipping           1482535 non-null  int64  
 7   item_description   1482529 non-null  object 
dtypes: float64(1), int64(3), object(4)
memory usage: 90.5+ MB
None
train_id name item_condition_id category_name brand_name price shipping item_description
0 0 MLB Cincinnati Reds T Shirt Size XL 3 Men/Tops/T-shirts NaN 10.0 1 No description yet
1 1 Razer BlackWidow Chroma Keyboard 3 Electronics/Computers & Tablets/Components & P... Razer 52.0 0 This keyboard is in great condition and works ...
2 2 AVA-VIV Blouse 1 Women/Tops & Blouses/Blouse Target 10.0 1 Adorable top with a hint of lace and a key hol...
3 3 Leather Horse Statues 1 Home/Home Décor/Home Décor Accents NaN 35.0 1 New with tags. Leather horses. Retail for [rm]...
4 4 24K GOLD plated rose 1 Women/Jewelry/Necklaces NaN 44.0 0 Complete with certificate of authenticity
In [4]:
print(test.shape)
print(test.info())
display(test.head())
(693359, 7)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 693359 entries, 0 to 693358
Data columns (total 7 columns):
 #   Column             Non-Null Count   Dtype 
---  ------             --------------   ----- 
 0   test_id            693359 non-null  int64 
 1   name               693359 non-null  object
 2   item_condition_id  693359 non-null  int64 
 3   category_name      690301 non-null  object
 4   brand_name         397834 non-null  object
 5   shipping           693359 non-null  int64 
 6   item_description   693359 non-null  object
dtypes: int64(3), object(4)
memory usage: 37.0+ MB
None
test_id name item_condition_id category_name brand_name shipping item_description
0 0 Breast cancer "I fight like a girl" ring 1 Women/Jewelry/Rings NaN 1 Size 7
1 1 25 pcs NEW 7.5"x12" Kraft Bubble Mailers 1 Other/Office supplies/Shipping Supplies NaN 1 25 pcs NEW 7.5"x12" Kraft Bubble Mailers Lined...
2 2 Coach bag 1 Vintage & Collectibles/Bags and Purses/Handbag Coach 1 Brand new coach bag. Bought for [rm] at a Coac...
3 3 Floral Kimono 2 Women/Sweaters/Cardigan NaN 0 -floral kimono -never worn -lightweight and pe...
4 4 Life after Death 3 Other/Books/Religion & Spirituality NaN 1 Rediscovering life after the loss of a loved o...
In [5]:
# データ数を取得
data_counts = [len(train), len(test)]
labels = ['Train Data', 'Test Data']

# 円グラフを作成
plt.figure(figsize=(8, 6))

# カスタムラベルを作成
def func(pct, allvalues):
    absolute = int(pct / 100. * sum(allvalues))
    return f'{pct:.1f}%\n({absolute})'

plt.pie(data_counts, labels=labels, autopct=lambda pct: func(pct, data_counts), startangle=90)
plt.title('Train and Test Data Counts')
plt.axis('equal')  # 円を真円にする
plt.show()
No description has been provided for this image

欠損値¶

In [6]:
print(train.isnull().sum())
train_id                  0
name                      0
item_condition_id         0
category_name          6327
brand_name           632682
price                     0
shipping                  0
item_description          6
dtype: int64
In [7]:
print(test.isnull().sum())
test_id                   0
name                      0
item_condition_id         0
category_name          3058
brand_name           295525
shipping                  0
item_description          0
dtype: int64
In [8]:
# 欠損値の個数を計算
missing_data_train = {
    'category_name': train['category_name'].isnull().sum(),
    'brand_name': train['brand_name'].isnull().sum(),
    'item_description': train['item_description'].isnull().sum()
}

total_rows = len(train)

# 各列の欠損値の割合を計算
missing_percentage = {k: (v / total_rows) * 100 for k, v in missing_data_train.items()}

# 横並びのサブプロットを作成
fig, axes = plt.subplots(1, 3, figsize=(18, 6))

# 欠損値をプロットする関数
def plot_missing_data(ax, column_name, missing_count, missing_percent):
    labels = ['Missing', 'Not Missing']
    sizes = [missing_count, total_rows - missing_count]
    explode = (0.1, 0)  # "Missing"のスライスを少し外側に表示

    ax.pie(sizes, explode=explode, labels=labels, autopct='%1.1f%%', startangle=90)
    ax.set_title(f"{column_name} Missing Values\nCount: {missing_count} ({missing_percent:.2f}%)")
    ax.axis('equal')  # 円を真円にする

# 各列ごとに欠損値をプロット
for ax, column in zip(axes, missing_data_train.keys()):
    plot_missing_data(ax, column, missing_data_train[column], missing_percentage[column])

plt.tight_layout()  # レイアウトを調整して重ならないようにする
plt.show()
No description has been provided for this image
In [9]:
# 欠損値の個数を計算
missing_data_test = {
    'category_name': test['category_name'].isnull().sum(),
    'brand_name': test['brand_name'].isnull().sum()
}

total_rows_test = len(test)

# 各列の欠損値の割合を計算
missing_percentage_test = {k: (v / total_rows_test) * 100 for k, v in missing_data_test.items()}

# 横並びのサブプロットを作成
fig, axes = plt.subplots(1, 2, figsize=(12, 6))

# 欠損値をプロットする関数
def plot_missing_data(ax, column_name, missing_count, missing_percent):
    labels = ['Missing', 'Not Missing']
    sizes = [missing_count, total_rows_test - missing_count]
    explode = (0.1, 0)  # "Missing"のスライスを少し外側に表示

    ax.pie(sizes, explode=explode, labels=labels, autopct='%1.1f%%', startangle=90)
    ax.set_title(f"{column_name} Missing Values\nCount: {missing_count} ({missing_percent:.2f}%)")
    ax.axis('equal')  # 円を真円にする

# 各列ごとに欠損値をプロット
for ax, column in zip(axes, missing_data_test.keys()):
    plot_missing_data(ax, column, missing_data_test[column], missing_percentage_test[column])

plt.tight_layout()  # レイアウトを調整して重ならないようにする
plt.show()
No description has been provided for this image

'price'¶

In [10]:
train['price'].describe()
Out[10]:
count    1.482535e+06
mean     2.673752e+01
std      3.858607e+01
min      0.000000e+00
25%      1.000000e+01
50%      1.700000e+01
75%      2.900000e+01
max      2.009000e+03
Name: price, dtype: float64
In [11]:
plt.subplot(1, 2, 1)
(train['price']).plot.hist(bins=50, figsize=(20,10), edgecolor='white',range=[0,250])
plt.xlabel('price+', fontsize=17)
plt.ylabel('frequency', fontsize=17)
plt.tick_params(labelsize=15)
plt.title('Price Distribution', fontsize=17)

plt.subplot(1, 2, 2)
np.log(train['price']+1).plot.hist(bins=50, figsize=(20,10), edgecolor='white')
plt.xlabel('log (price+1)', fontsize=17)
plt.ylabel('frequency', fontsize=17)
plt.tick_params(labelsize=15)
plt.title('Log (Price) Distribution', fontsize=17)
plt.show()
No description has been provided for this image
In [12]:
plt.figure(figsize=(10, 6))
sns.boxplot(x=train['price'], color='salmon')
plt.title('Box Plot of Prices')
plt.xlabel('Price')
plt.grid()
plt.show()
No description has been provided for this image

'name'¶

In [13]:
# 商品名の長さを計算
train['name_length'] = train['name'].str.len()

# ヒストグラムを作成
plt.figure(figsize=(10, 6))
plt.hist(train['name_length'], bins=30, color='skyblue', edgecolor='black')
plt.title('Distribution of Product Name Lengths')
plt.xlabel('Length of Product Name')
plt.ylabel('Frequency')
plt.grid(axis='y')
plt.show()
No description has been provided for this image
In [14]:
%%time
# 商品名を結合
text = ' '.join(train['name'])

# ワードクラウドを生成
wordcloud = WordCloud(width=800, height=400, background_color='white').generate(text)

# ワードクラウドを表示
plt.figure(figsize=(10, 6))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
plt.title('Word Cloud of Product Names')
plt.show()
No description has been provided for this image
CPU times: user 34.5 s, sys: 1.86 s, total: 36.4 s
Wall time: 36.1 s
In [15]:
# 商品名の長さと価格の関係
plt.figure(figsize=(10, 6))
plt.scatter(train['name_length'], train['price'], alpha=0.5)
plt.title('Product Name Length vs Price')
plt.xlabel('Length of Product Name')
plt.ylabel('Price')
plt.grid()
plt.show()
No description has been provided for this image

'item_conditon_id'¶

In [16]:
# item_condition_id のカウントを取得
condition_counts = train['item_condition_id'].value_counts().sort_index()

# 棒グラフを作成
plt.figure(figsize=(8, 6))
condition_counts.plot(kind='bar', color='skyblue')
plt.title('Number of Products by Item Condition')
plt.xlabel('Item Condition ID')
plt.ylabel('Number of Products')
plt.xticks(rotation=0)
plt.grid(axis='y')
plt.show()
No description has been provided for this image

'category_name'¶

In [17]:
# TOP 5 RAW CATEGORIES
train['category_name'].value_counts()[:5]
Out[17]:
category_name
Women/Athletic Apparel/Pants, Tights, Leggings    60177
Women/Tops & Blouses/T-Shirts                     46380
Beauty/Makeup/Face                                34335
Beauty/Makeup/Lips                                29910
Electronics/Video Games & Consoles/Games          26557
Name: count, dtype: int64
In [18]:
def split_cat(text):
    try: return text.split("/")
    except: return ("No Label", "No Label", "No Label")
In [19]:
%%time
train['general_cat'], train['subcat_1'], train['subcat_2'] = \
zip(*train['category_name'].apply(lambda x: split_cat(x)))
train.head()
CPU times: user 5.45 s, sys: 259 ms, total: 5.71 s
Wall time: 5.71 s
Out[19]:
train_id name item_condition_id category_name brand_name price shipping item_description name_length general_cat subcat_1 subcat_2
0 0 MLB Cincinnati Reds T Shirt Size XL 3 Men/Tops/T-shirts NaN 10.0 1 No description yet 35 Men Tops T-shirts
1 1 Razer BlackWidow Chroma Keyboard 3 Electronics/Computers & Tablets/Components & P... Razer 52.0 0 This keyboard is in great condition and works ... 32 Electronics Computers & Tablets Components & Parts
2 2 AVA-VIV Blouse 1 Women/Tops & Blouses/Blouse Target 10.0 1 Adorable top with a hint of lace and a key hol... 14 Women Tops & Blouses Blouse
3 3 Leather Horse Statues 1 Home/Home Décor/Home Décor Accents NaN 35.0 1 New with tags. Leather horses. Retail for [rm]... 21 Home Home Décor Home Décor Accents
4 4 24K GOLD plated rose 1 Women/Jewelry/Necklaces NaN 44.0 0 Complete with certificate of authenticity 20 Women Jewelry Necklaces
In [20]:
x = train['general_cat'].value_counts().index.values.astype('str')
y = train['general_cat'].value_counts().values
pct = [("%.2f"%(v*100))+"%"for v in (y/len(train))]
In [21]:
trace1 = go.Bar(x=x, y=y, text=pct)
layout = dict(title= 'Number of Items by Main Category',
              yaxis = dict(title='Count'),
              xaxis = dict(title='Category'))
fig=dict(data=[trace1], layout=layout)
py.iplot(fig)
In [22]:
x = train['subcat_1'].value_counts().index.values.astype('str')[:15]
y = train['subcat_1'].value_counts().values[:15]
pct = [("%.2f"%(v*100))+"%"for v in (y/len(train))][:15]
In [23]:
trace1 = go.Bar(x=x, y=y, text=pct,
                marker=dict(
                color = y,colorscale='Portland',showscale=True,
                reversescale = False
                ))
layout = dict(title= 'Number of Items by Sub Category (Top 15)',
              yaxis = dict(title='Count'),
              xaxis = dict(title='SubCategory'))
fig=dict(data=[trace1], layout=layout)
py.iplot(fig)

価格設定(価格の対数)の観点から見ると、すべてのカテゴリがかなり適切に分散されており、特別な価格設定のカテゴリはない。

In [24]:
general_cats = train['general_cat'].unique()
x = [train.loc[train['general_cat']==cat, 'price'] for cat in general_cats]
In [25]:
data = [go.Box(x=np.log(x[i]+1), name=general_cats[i]) for i in range(len(general_cats))]
In [26]:
layout = dict(title="Price Distribution by General Category",
              yaxis = dict(title='Category'),
              xaxis = dict(title='Frequency'))
fig = dict(data=data, layout=layout)
py.iplot(fig)

'brand_name'¶

In [27]:
x = train['brand_name'].value_counts().index.values.astype('str')[:10]
y = train['brand_name'].value_counts().values[:10]
In [28]:
trace1 = go.Bar(x=x, y=y, 
                marker=dict(
                color = y,colorscale='Portland',showscale=True,
                reversescale = False
                ))
layout = dict(title= 'Top 10 Brand by Number of Items',
              yaxis = dict(title='Count'),
              xaxis = dict(title='Brand Name'))
fig=dict(data=[trace1], layout=layout)
py.iplot(fig)

'shipping'¶

送料負担は売り手と買い手の間で適切に分割されており、商品の送料の半分以上(55%)は売り手が負担している。さらに、送料を支払う必要があるユーザーが支払う平均価格は、追加の送料を必要としないユーザーよりも低くなっている。これは、売り手が追加の送料を補うために価格を下げていることを意味する。

In [29]:
train.shipping.value_counts()/len(train)
Out[29]:
shipping
0    0.552726
1    0.447274
Name: count, dtype: float64
In [30]:
# shipping のカウントを取得
shipping_counts = train['shipping'].value_counts()

# 円グラフを作成
plt.figure(figsize=(8, 6))
plt.pie(shipping_counts, labels=['Buyer Pays', 'Seller Pays'], autopct='%1.1f%%', startangle=90)
plt.title('Distribution of Shipping Cost Responsibility')
plt.axis('equal')  # 円を真円にする
plt.show()
No description has been provided for this image
In [31]:
prc_shipBySeller = train.loc[train.shipping==1, 'price']
prc_shipByBuyer = train.loc[train.shipping==0, 'price']
In [32]:
# グラフの作成
fig, ax = plt.subplots(figsize=(20, 10))
ax.hist(np.log(prc_shipBySeller + 1), color='#8CB4E1', alpha=1.0, bins=50,
         label='Seller pays Shipping')
ax.hist(np.log(prc_shipByBuyer + 1), color='#007D00', alpha=0.7, bins=50,
         label='Buyer pays Shipping')

# タイトルとラベルの設定
ax.set(title='Histogram Comparison', ylabel='% of Dataset in Bin')
plt.xlabel('log (price + 1)', fontsize=17)
plt.ylabel('Frequency', fontsize=17)
plt.title('Price Distribution by Shipping Type', fontsize=17)
plt.tick_params(labelsize=15)

# 凡例の表示
ax.legend(fontsize=15)

# グラフの表示
plt.show()
No description has been provided for this image

'item_description'¶

非構造化データであるため、解析がより困難になる。
より詳細で長い説明をすると、入札価格が高くなる?
すべての句読点を削除し、一部の英語のストップワード (つまり、「a」、「the」などの冗長な単語) と、長さが 3 未満のその他の単語を削除する。

In [33]:
def wordCount(text):
    """
    Parameters:
      text(str): 商品名、商品の説明文
    """
    try:
        if text == 'No description yet':
            return 1
        else:
            text = text.lower()
            regex = re.compile('[' + re.escape(string.punctuation) + '0-9\\r\\t\\n]')
            text = regex.sub(" ", text)

            words = [w for w in text.split() if w and w not in ENGLISH_STOP_WORDS and len(w) > 3]
            return len(words)
    except: 
        return 0
In [34]:
%%time
# add a column of word counts to both the training and test set
train['desc_len'] = train['item_description'].apply(lambda x: wordCount(x))
test['desc_len'] = test['item_description'].apply(lambda x: wordCount(x))
CPU times: user 31.5 s, sys: 2.8 ms, total: 31.5 s
Wall time: 31.5 s
In [35]:
train.head()
Out[35]:
train_id name item_condition_id category_name brand_name price shipping item_description name_length general_cat subcat_1 subcat_2 desc_len
0 0 MLB Cincinnati Reds T Shirt Size XL 3 Men/Tops/T-shirts NaN 10.0 1 No description yet 35 Men Tops T-shirts 1
1 1 Razer BlackWidow Chroma Keyboard 3 Electronics/Computers & Tablets/Components & P... Razer 52.0 0 This keyboard is in great condition and works ... 32 Electronics Computers & Tablets Components & Parts 14
2 2 AVA-VIV Blouse 1 Women/Tops & Blouses/Blouse Target 10.0 1 Adorable top with a hint of lace and a key hol... 14 Women Tops & Blouses Blouse 8
3 3 Leather Horse Statues 1 Home/Home Décor/Home Décor Accents NaN 35.0 1 New with tags. Leather horses. Retail for [rm]... 21 Home Home Décor Home Décor Accents 14
4 4 24K GOLD plated rose 1 Women/Jewelry/Necklaces NaN 44.0 0 Complete with certificate of authenticity 20 Women Jewelry Necklaces 3
In [36]:
df = train.groupby('desc_len')['price'].mean().reset_index()
In [37]:
trace1 = go.Scatter(
    x = df['desc_len'],
    y = np.log(df['price']+1),
    mode = 'lines+markers',
    name = 'lines+markers'
)
layout = dict(title= 'Average Log (Price) by Description Length',
              yaxis = dict(title='Average Log (Price)'),
              xaxis = dict(title='Description Length'))
fig=dict(data=[trace1], layout=layout)
py.iplot(fig)
In [38]:
# ref. Mercari Interactive EDA + Topic Modelling
# https://www.kaggle.com/code/thykhuely/mercari-interactive-eda-topic-modelling